A house value is simply more than location and square footage. Like the features that make up a person, an educated party would want to know all aspects that give a house its value. For example, you want to sell a house and you don’t know the price which you can take — it can’t be too low or too high. To find house price you usually try to find similar properties in your neighbourhood and based on gathered data you will try to assess your house price.
The proposed project is about determining the ‘price’ of a house based on all the features of the available dataset and not only on the location and square footage. The problem statement also clearly states that the value of the house is not being predicted only from the buyer perspective but also from the seller perspective with the given dataset to decide on the correct ‘price’ to be tagged for the house. To have a clear understanding of the given dataset we started our work by analyzing it in excel workbook using filter which gave some interesting insights on the given dataset,
The correlation between attributes is not clear from excel workbook, which can be seen using python notebook.
With this inference it made us easy to start our analysis for the given dataset.
Note: Since we have one target variable – ‘price’ and all other attributes are independent variables, we will be training our model based on the independent variables. So this problem lies under Supervised learning method.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from scipy.stats import zscore
inn_df = pd.read_csv("innercity.csv")
inn_df
inn_df.describe()
inn_df.info()
inn_df = inn_df.drop(['cid'], axis=1)
inn_df = inn_df.drop(['dayhours'], axis=1)
The univariate analysis can be started by using different basic plots.
inn_df1=inn_df
from scipy.stats import zscore
inn_df1 =inn_df.apply(zscore)
inn_df1.head()
plt.figure(figsize=(10, 5))
sns.distplot(inn_df['price'], color='orange')
*The above target column is skewed towards right, because of outliers.
*To check whether it is normally distributed or not let us use skewness and kurtosis test.
from scipy.stats import skew
from scipy.stats import kurtosis
skew(inn_df['price'])
kurtosis(inn_df['price'])
plt.figure(figsize=(14, 5))
counts, bin_edges = np.histogram(inn_df['price'], bins=10, density=True)
plt.xlabel('toom_bed')
pdf= counts/(sum(counts))
print("pdf=", pdf);
print("bin_edges=", bin_edges);
cdf = np.cumsum(pdf)
print("cdf=", cdf);
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:],cdf);
*The orange plot shows the cdf,From the plot it is very clear that around 90% of the houses have a cumulative price range around 10,00,000. So the probability of house to be bought will be in and around 10,00,000.
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['room_bed'], color='orange')
*From the plot we can see a lot of cluters with uneven peaks, which suggest the distribution not normal. This may be due to outliers or missing value in the data.
*When the data is analyzed using filter in excel we were able to see some zeros, but when it was related with other features we see that those zeros might be missing values.
*Apart from that we can see an outlier, which when related with other features semms to be an error value.
plt.figure(figsize=(14, 5))
sns.distplot(inn_df['room_bath'], color='orange')
*room_bath denotes (Number of bathrooms/bedrooms) so it depends on the no of bedrooms. Few cases the no of bathrooms are said to be zero but the room_bed has got some integer value which is not possible. So zero denotes missing value, which is mentioned above
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['living_measure'], color='orange')
*Living measure denotes the total area constructed. The distance plot is also skewed toward right.
skew(inn_df['living_measure'])
kurtosis(inn_df['living_measure'])
plt.figure(figsize=(6, 5))
sns.distplot(inn_df['lot_measure'], color='orange')
*lot_measure is square footage of the lot, which is completely right skewed.
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['ceil'], color='orange')
ceil: Denotes total floor in the house. we can clearly see most of the houses has 1 and 2 floors. But few houses have 1.5,2.5 and 3.5 floors, which is confusing. Is the data wrong or few houses has half floor constructed. But how can the no of levels or floors be considered half?
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['coast'], color='orange')
*There are only few houses near the coast.
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['sight'], color='orange')
*From this it is very clear that most number of houses are not visited and maximum time a house was visted was 4.
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['condition'], color='orange')
condition: How good the condition is (Overall). Based on this every house is rated on a scale of 5. It is very clear that most of the house has a rating of 3 and few houses has 4 and 5. But very few houses are rated low.
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['quality'], color='orange')
*quality: grade given to the housing unit, based on grading system(1-13) being 13 a better housing unit. Most housing units have the quality with a maximum grade of 7.
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['ceil_measure'], color='orange')
*ceil_measure: square footage of house apart from basement: It has slight right skewed distribution.
skew(inn_df['ceil_measure'])
kurtosis(inn_df['ceil_measure'])
As kurtosis is greater than 2 then the distribution is not said to be linear
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['basement'], color='orange')
*basement_measure: square footage of the basement. The range varies from (0-4820)sqft. From the plot it seems to be most houses has less basement area.
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['yr_built'], color='orange')
yr_built: Built Year
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['yr_renovated'], color='orange')
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['zipcode'], color='orange')
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['lat'], color='orange')
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['long'], color='orange')
*longitue, latitude and pincode columns when analyzed doesnot have a big impact on the target column 'price'
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['living_measure15'], color='orange')
living_measure: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area. Here we can see the normal distribution of the plot.
plt.figure(figsize=(5, 5))
sns.distplot(inn_df['lot_measure15'], color='orange')
lot_measure15: lotSize area in 2015(implies-- some renovations)
plt.figure(figsize=(14, 5))
sns.distplot(inn_df['furnished'], color='orange')
furnished: It is a catagoraical variable, it says whether the house is furnised or not with '0' and '1'. '0': Not furnished '1': Furnished Here we see many houses not being furnished compared to the furnished ones.
plt.figure(figsize=(6, 5))
sns.distplot(inn_df['total_area'], color='orange')
Comparing lot_measure and total_area, we see same distribution in case of historam.
From all the graph PDF represents the distribution curve, similar to histograms.
inn_df.boxplot(figsize = (20,5))
inn_df1=inn_df
from scipy.stats import zscore
inn_df1 =inn_df.apply(zscore)
inn_df1.head()
inn_df1.boxplot(figsize = (20,5))
Q1 = inn_df.quantile(0.25)
Q3 = inn_df.quantile(0.75)
IQR = Q3 - Q1
((inn_df < (Q1 - 1.5 * IQR)) | (inn_df > (Q3 + 1.5 * IQR))).sum()
sns.pairplot(inn_df)
correlation = inn_df.corr()
plt.figure(figsize=(20, 15))
sns.heatmap(correlation,annot=True, linewidth=0, vmin=-1)
All the frunished houses have better quallity
living_measure, quality, ceil_measure, furnished & roon_bath have better correlation with the tatget variable ('price'). So for predicting the price of a property these attributes play a major role in this dataset
y = inn_df.iloc[:,0]
inn = inn_df.iloc[:,1:]
from sklearn import model_selection
test_size = 0.30 # taking 70:30 training and test set
seed = 7 # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn, y, test_size=test_size, random_state=seed)
#Testing using Simple linear model
from sklearn.linear_model import LinearRegression
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
regression_model.coef_
regression_model.intercept_
regression_model.score(X_train, y_train)
regression_model.score(X_test, y_test)
# Testing using Ridge & Lasso method
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
ridge = Ridge(alpha=.3)
ridge.fit(X_train, y_train)
print ("Ridge model:", (ridge.coef_))
lasso = Lasso(alpha=0.2)
lasso.fit(X_train, y_train)
print ("Lasso model:", (lasso.coef_))
print(ridge.score(X_train, y_train))
print(ridge.score(X_test, y_test))
print(lasso.score(X_train, y_train))
print(lasso.score(X_test, y_test))
Model score obtained is 70% for both ridge & lasso methods
# Testing using Decision Tree Regressor
from sklearn.preprocessing import Imputer
from sklearn.tree import DecisionTreeRegressor
d1 = DecisionTreeRegressor(max_depth = 10)
d1.fit(X_train, y_train)
d1.score(X_train, y_train)
d1.score(X_test, y_test)
# Testing using Gradient Boosting Method
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor()
dt_gb = dt_gb.fit(X_train, y_train)
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
# Testing using Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
dt_rf = RandomForestRegressor()
dt_rf = dt_rf.fit(X_train, y_train)
test_pred3 = dt_rf.predict(X_test)
dt_rf.score(X_test, y_test)
The different iterations through which we try to improve our model score are,
- Removing the outliers using (mean+3*SD) on the existing dataset and developing a model using GB and RF method.
- Replacing ‘Zeros’ from the attributes which are providing wrong information with their respective column medians and developing a model using GB and RF
- Using PCA to analyse the minimum number of attributes needed for developing a better model using GB or RF which can provide atleast 95% variance.
- Analysing the model performance using GB and RF by removing attributes.
- Using Polynomial function on the existing dataset and test the model score using GB & RF
- Changing the learning rate and the estimators in GB & RF
a = np.mean(inn_df, axis=0)
b = np.std(inn_df, axis=0)
d = np.median(inn_df, axis=0)
c = (a+(3*b))
d
inn_new = np.where(inn_df > c, c, inn_df)
inn_new1 = pd.DataFrame(inn_new)
inn_new1.boxplot(figsize = (25,5))
inn_new2 =inn_new1.apply(zscore)
inn_new2.head()
inn_new2.boxplot(figsize = (25,5))
From the plot it is clear that the no of outlier in each attribute has been reduced. But to get a clear view let us find the count of outliers for each attribute
Q1 = inn_new1.quantile(0.25)
Q3 = inn_new1.quantile(0.75)
IQR = Q3 - Q1
((inn_new1 < (Q1 - 1.5 * IQR)) | (inn_new1 > (Q3 + 1.5 * IQR))).sum()
# Let us now test the obtained dataset using GB and RF to find whether this have any effect on the performacne of the model
y1 = inn_new1.iloc[:,0]
inn1 = inn_new1.iloc[:,1:]
test_size = 0.30 # taking 70:30 training and test set
seed = 7 # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn1, y1, test_size=test_size, random_state=seed)
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor()
dt_gb = dt_gb.fit(X_train, y_train)
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
from sklearn.ensemble import RandomForestRegressor
dt_rf = RandomForestRegressor()
dt_rf = dt_rf.fit(X_train, y_train)
test_pred3 = dt_rf.predict(X_test)
dt_rf.score(X_test, y_test)
mean_room_bath = inn1['room_bath'].mean(skipna=True)
inn3=inn1.replace({'room_bath': {0: mean_room_bath}})
# Let us now test the obtained dataset using GB and RF to find whether this have any effect on the performacne of the model
test_size = 0.30 # taking 70:30 training and test set
seed = 7 # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn3, y1, test_size=test_size, random_state=seed)
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor()
dt_gb = dt_gb.fit(X_train, y_train)
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
from sklearn.ensemble import RandomForestRegressor
dt_rf = RandomForestRegressor()
dt_rf = dt_rf.fit(X_train, y_train)
test_pred3 = dt_rf.predict(X_test)
dt_rf.score(X_test, y_test)
test_size = 0.30 # taking 70:30 training and test set
seed = 7 # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn1, y1, test_size=test_size, random_state=seed)
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor(n_estimators = 290, learning_rate=0.22)
dt_gb = dt_gb.fit(X_train, y_train)
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
from sklearn.ensemble import RandomForestRegressor
dt_rf = RandomForestRegressor(n_estimators = 290, max_depth = 100)
dt_rf = dt_rf.fit(X_train, y_train)
test_pred3 = dt_rf.predict(X_test)
dt_rf.score(X_test, y_test)
inn2=inn1.apply(zscore)
cov_matrix = np.cov(inn2.T)
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
plt.figure(figsize=(5,5))
plt.plot(var_exp)
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
# Sort eigenvalues in descending order
# Make a set of (eigenvalue, eigenvector) pairs
eig_pairs = [(eig_vals[index], eig_vecs[:,index]) for index in range(len(eig_vals))]
# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eig_pairs.sort()
eig_pairs.reverse()
print(eig_pairs)
# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eig_pairs[index][0] for index in range(len(eig_vals))]
eigvectors_sorted = [eig_pairs[index][1] for index in range(len(eig_vals))]
P_reduce = np.array(eigvectors_sorted[0:17]) # Reducing from 18 to 7 dimension space
inn_4D = np.dot(inn2,P_reduce.T) # projecting original data into principal component dimensions
inn_data_df = pd.DataFrame(inn_4D) # converting array to dataframe for pairplot
test_size = 0.30 # taking 70:30 training and test set
seed = 7 # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn_data_df, y1, test_size=test_size, random_state=seed)
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor(n_estimators = 290, learning_rate=0.22)
dt_gb = dt_gb.fit(X_train, y_train)
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
inn_df = pd.read_csv("innercity.csv")
inn_df = inn_df.drop(['cid'], axis=1)
inn_df = inn_df.drop(['dayhours'], axis=1)
#inn_df = inn_df.drop(['room_bed'], axis=1)
#inn_df = inn_df.drop(['room_bath'], axis=1)
#inn_df = inn_df.drop(['living_measure'], axis=1)
inn_df = inn_df.drop(['lot_measure'], axis=1)
#inn_df = inn_df.drop(['ceil'], axis=1)
#inn_df = inn_df.drop(['coast'], axis=1)
#inn_df = inn_df.drop(['sight'], axis=1)
#inn_df = inn_df.drop(['condition'], axis=1)
#inn_df = inn_df.drop(['quality'], axis=1)
#inn_df = inn_df.drop(['ceil_measure'], axis=1)
inn_df = inn_df.drop(['basement'], axis=1)
#inn_df = inn_df.drop(['yr_built'], axis=1)
#inn_df = inn_df.drop(['yr_renovated'], axis=1)
#inn_df = inn_df.drop(['zipcode'], axis=1)
#inn_df = inn_df.drop(['lat'], axis=1)
#inn_df = inn_df.drop(['long'], axis=1)
#inn_df = inn_df.drop(['living_measure15'], axis=1)
#inn_df = inn_df.drop(['lot_measure15'], axis=1)
inn_df = inn_df.drop(['furnished'], axis=1)
#inn_df = inn_df.drop(['total_area'], axis=1)
a = np.mean(inn_df, axis=0)
b = np.std(inn_df, axis=0)
c = (a+(3*b))
inn_new = np.where(inn_df > c, c, inn_df)
inn_new1 = pd.DataFrame(inn_new)
y2 = inn_new1.iloc[:,0]
inn2 = inn_new1.iloc[:,1:]
test_size = 0.30 # taking 70:30 training and test set
seed = 7 # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(inn2, y2, test_size=test_size, random_state=seed)
from sklearn.ensemble import GradientBoostingRegressor
dt_gb = GradientBoostingRegressor(n_estimators = 290, learning_rate=0.22)
dt_gb = dt_gb.fit(X_train, y_train)
test_pred2 = dt_gb.predict(X_test)
dt_gb.score(X_test, y_test)
Removing other columns along with the dayhours and id to find out which column helps in improving the model performance
Removing two and greater than two attributes
lot_measure & basement - The model score was 89.983%
lot_measure, lot_measure15 & basement - The model score was 89.936%
lot_measure, furnished & basement - The model score was 89.989%